Online Optimization of Product-Form Networks 
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Abstract — We develop an online gradient algorithm for op- 
timizing the performance of product-form networks through 
online adjustment of control parameters. The use of standard 
algorithms for finding optimal parameter settings is hampered by 
the prohibitive computational burden of calculating the gradient 
in terms of the stationary probabilities. The proposed approach 
instead relies on measuring empirical frequencies of the various 
states through simulation or online operation so as to obtain 
estimates for the gradient. Besides the reduction in computational 
effort, a further benefit of the online operation lies in the 
natural adaptation to slow variations in ambient parameters as 
commonly occurring in dynamic environments. On the downside, 
the measurements result in inherently noisy and biased estimates. 
We exploit mixing time results in order to overcome the impact 
of the bias and establish sufficient conditions for convergence to 
a globally optimal solution. 

Index Terms — Gradient algorithm, Markov processes, mixing 
times, online performance optimization, product-form networks, 
stochastic approximation, dynamic control. 

I. Introduction 

Markov processes provide a versatile framework for mod- 
elling a wide variety of stochastic systems, ranging from 
communication networks and data center applications to con- 
tent dissemination systems and physical or social interaction 
processes [1|, [2|, [3]. In particular, key performance measures 
of the system under consideration, e.g. buffer occupancies, 
response times, loss probabilities or user throughputs, can 
typically be expressed in terms of the stationary distribution it 
of the Markov process. 

In many applications, the stationary distribution it, and 
hence the performance measures or statistical properties, cru- 
cially depend on system parameters r that can be controlled, 
e.g. admission thresholds, service rates, link weights or re- 
source capacities. In those cases, the interest is often not so 
much in evaluating the performance of the system for given 
parameter values, but rather in finding parameter settings r opt 
that optimize the performance or achieve an optimal trade-off 
between service level and costs. 

Specifically, let u(7r(r)) be a function expressing the perfor- 
mance objective (to be minimized) in terms of the stationary 
distribution ir(r) as function of the system parameters r and 
let c(r) be a function representing possible cost associated 
with r, e.g. capital expense or power consumption. Introducing 
u(r) — u(n(r)) + c(r), the problem of interest may then be 



mathematically formulated as finding 



»opt _ 



argmin ii(r). 



(1) 



It is worth observing here that the problem formulation differs 
from the typical Markov decision processes ||4), 0, which 
focus on selecting optimal actions in various states rather than 
identifying optimal parameter values. 

Optimization problem (fl]i could in principle be solved using 
mathematical programming approaches such as gradient-based 
schemes. In addition to the usual convexity issues, however, 
a further difficulty arises from the fact that the stationary 
distribution ir(r) is only implicitly determined as a function 
of r by the balance equations and is rarely available in explicit 
form, which severely complicates both the evaluation of the 
objective function u(r) and calculation of its gradient V r u(r). 

In the present paper we develop a gradient approach to solve 
the optimization problem ((TJ for a class of Markov processes 
with product-form distributions. This class of processes arises 
in a rich family of stochastic models, such as loss networks 
dD, 0, open and closed queueing networks HO, 0, wire- 
less random-access networks iflOl . ifTTl and various types of 
interacting -particle systems (TJ, 0. 

As we will show, the partial derivatives dir{r)/dr for this 
class of processes can be written as linear combinations of 
products of stationary probabilities 7r(r), thus reducing the 
computation of the gradient to the evaluation of the equilib- 
rium distribution. The problem that yet remains in many situa- 
tions is that the stationary probabilities involve a normalization 
constant whose calculation is computationally intensive and 
potentially NP-hard lfT2l . This issue is particularly pertinent 
in the context of iterative optimization algorithms such as 
gradient-based schemes, where partial derivatives need to be 
calculated repeatedly. 

In order to circumvent the computational burden of cal- 
culating the stationary probabilities, we adopt a gradient 
approach which relies on measuring the empirical frequencies 
of the various states so as to estimate the partial derivatives. 
Specifically, in each iteration we observe the stochastic process 
for some time period through simulation or online operation, 
and we then calculate estimates for the gradient based on 
the measured time fractions of the various states. Although 
the number of states may be extremely large, it turns out 
that in many situations one only needs to track the time 



fractions of aggregate states rather than all individual states, 
and that these aggregate states can be observed in an entirely 
distributed fashion. Besides the reduction in computational 
effort, a further benefit of the online operation lies in the fact 
that the algorithm will automatically adapt to slow variations 
in ambient parameters which are fairly common in dynamic 
environments. 

While the measurements bypass the computational effort of 
calculating the stationary probabilities, they result in inherently 
noisy and biased estimates for the gradient. The issue of noisy 
estimates is paramount in the field of stochastic approxima- 
tion, where years of research have resulted in many robust 
stochastic approximation schemes which can cope with various 
stochastic processes and forms of random noise |[T3l , fBl . 
In contrast, biased estimates present a much trickier issue, 
which is usually not accounted for in stochastic approximation 
schemes. In order to neutralize the impact of the bias, we focus 
the attention on the family of reversible processes within the 
above-mentioned class of Markov processes with product-form 
distributions ||9). For reversible processes, powerful results 
are known for mixing times fl5l . fl6l . which allow us to 
derive sufficient conditions guaranteeing convergence to the 
optimal solution of ([T). Intuitively, the mixing times provide 
an indication for the period of time that we need to observe 
the stochastic process in order to overcome the impact of the 
bias. 

As a further condition to ensure convergence to the globally 
optimal solution of (Q]) rather than a possible local optimum, 
we assume the optimization objective u(r) to be convex in r. 
While convexity is generally non-trivial to establish, this can 
be easily verified for the broad class of so-called log-likelihood 
functions 

u(r) = w(7r(r)) = — a T lri7r(r) = — a x lri7r x (r), (2) 

where £1 denotes the state space of the process, a x are 
fixed coefficients and n x (r) is the stationary probability of 
state x. Taking partial derivatives of we find that the 
first-order conditions reduce to linear constraints in terms 
of the stationary probabilities. In other words, the problem 
of attaining target values for expectations of functionals of 
the stationary distribution can be cast as an optimization 
objective of the form ©. A special case of (f2]i was recently 
investigated by Jiang and Walrand |fT71 , lfl8l . Their goal was to 
achieve target throughput values in CSMA networks by using 
an algorithm that adjusts the access or backoff parameters 
(represented by the vector r in (f2]l) using empirical arrival 
and service rates. This in fact provided valuable inspiration 
for the work presented here, where we extend the scope of 
such algorithms to general product-form Markov processes 
and a larger class of objective functions. These generalizations 
require a different approach to deal with the impact of bias, 
as discussed in mW-B 1 1 

Further important related work is done by Marbach and 
Tsitsiklis [19], ll20l . see also ll2l | for further background. In 
[19 1, [20 1, an algorithm similar in spirit to ours is considered 



- an algorithm that aims to tackle a parameter optimization 
problem by relying on measurement-based evaluation of a 
gradient. Their convergence proof also involves analysis of 
noisy and biased estimates and the generic use of Lyapunov 
functions and martingale arguments. However, their expression 
for the gradient is fundamentally different and hence the 
specific proof arguments substantially differ as well. Although 
[19 1, [20 1 can be applied to more general Markov processes 
and furnishes greater versatility in use, it does not take advan- 
tage of simplifications that arise from the specific structure of 
product-form distributions as in this paper. Most importantly, 
however, the algorithm in ||l9l . [20 1 differs in its updating 
method, because it updates parameters whenever the process 
visits recurrent states. Knowing whether the entire system 
is in a recurrent state (and thus when to update) requires 
information about all components of the system, making the 
algorithm in 1(191 . [20 1 global in nature. This differs from 
our algorithm and that presented in ifTTl . [18|, which can be 
implemented in a distributed manner. 

The remainder of the paper is organized as follows. In 
£im we present a detailed problem formulation, develop our 
measurement-based optimization algorithm and state our main 
results. Some illustrative application scenarios are described 
next in pill In EHVI we first identify conditions in terms of 
the measurement noise and bias which ensure the convergence 
of the algorithm, and we then prove that these conditions are 
satisfied. 

II. Algorithm description 

Throughout this paper, we denote by bi the i-th compo- 
nent of vector b. When taking a scalar function of an n- 
dimensional vector b, we do this component-wise, i.e. exp b = 
(exp&i, exp&„) T . If we have a | il | -dimensional vector b 
in which each component corresponds to some state x g fi, 
we write b x for that component of b that corresponds to state 
x. Similarly, we denote by Aij the element in row i, column 
j of matrix A. If rows and/or columns correspond to states 
in il, we write instead. Finally, we denote by 1„ the 

n-dimensional vector of which all components equal one. 

A. Gradient scheme 

Consider a Markov process {X(t)} t >o that is irreducible, 
reversible and has a finite state space fL Let 7r(r) denote its 
steady-state probability vector as a function of d parameters 
r = (ri, ...,rd) T , which arises naturally if one has a closed- 
form expression for the stationary distribution. The most 
prominent examples are the product-form distributions 

Z[r) 

where A € Rl f2 l xd is a matrix, b e Rl n l is a vector and Z{r) 
is the normalization constant. 

We consider the optimization problem 

minu(r), (4) 



where u(r) denotes an objective function that we assume to 
be convex in r on a hypercube 7Z C representing the 
feasible range for the parameters r. We furthermore require 
that (HJi has a unique minimizer r° pt = argmin r6 7j, u(r), 
and we assume that the gradient of u(r) can be written as 
a function of n(r) and r, i.e. V r ti(r) = g(n(r),r) where 
V r = (d/dri, ...,d/drd) T ■ For example when c(r) = 0, the 
gradient of u(r) = u(7r(r)) can be written as 



du(n(r)) 

dr t 



E 



du(Tv(r)) dn x (r) 
dir x (r) dr t 



(5) 



for i = 1 , . . . , d. For the important case of product-form 
distributions in ||3), we have 

dir x (r) 1 



Z{r)A^i exp (Ar + b) a 



rx(r)( 



(6) 



Z(r) 2 

- exp (Ar + b) x A Vil exp (Ar + b) 
yen 

YA y .nT y {r) 

yen 

so that du(Tv(r)) / dri = ^(^(r)) and therefore V r u(r) = 
g(n(r)). While for this example the gradient can be written 
as a function of only 7r(r), in ijlll-AI we will encounter an 
example for which it is more efficient to write the gradient as 
a function of both 7r(r) and r. For a calculation of such partial 
derivatives in a more general case of product-form networks, 
we refer the reader to (22]. 

Our goal is to find r° pt and in order to do so, it is natural 
to consider the gradient algorithm 

r [n+l] _ r r [n] _ (7) 

where g^ n+1 ^ = g(n(r^),r^), and n £ N indexes the 
iteration. The £ (0, oo) denote the step sizes of the 
algorithm, and we define the truncation operator as follows. 

Definition 1. For Kcl d of the form 



n = [RT n , Hi**] x ... x [7C m , n*r x } , (8) 

the truncation [r] K £ M. d ofr£ M d is defined component-wise 
as 



[r]f = max{ft?^,min{7C ut , »•<}}. 



(9) 



B. Online gradient algorithm 

It is well known that under suitable assumptions on the 
objective function and step sizes, the gradient algorithm in 
(|7]i generates a sequence r' n l that converges to the optimal 
solution r° pt . We also come back to this at the end of 8 1IV-AI 
Calculating the gradient, however, may be difficult in practice, 
because it depends on 7r(r), limiting the applicability of <j7j. 

Instead of using ©, we will estimate 7r(r) by observing 
the evolution of the system. These observations will take place 
during time intervals [t [n \^ n+ % where = f [0] < < .... 
At the end of each interval, say at time P- n+1 \ our algorithm 
will change the current system parameters to new param- 
eters based on its observations. 



The stochastic process {Y(t)} t >o that describes the sys- 
tem is given by Y(t) = Z^- n \t), where n is such that 
t £ The process {Z( n ](i)} tW < t < ([ » +1] is a time- 

homogeneous Markov process, which starts in ^["-^(tM) 
and evolves according to the generator of {X(t)} t >o that 
corresponds to parameters 

Let us now make precise how our algorithm observes the 
system and makes decisions. At time t^ n+1 \ marking the end 
of observation period n + 1, we calculate 



nt" +1 i = 



£[n+l] _ f M J tM 



l[Z [n] {t) = x]dt (10) 



for every state x £ Q,. During each interval, one thus keeps 
track of the fractions of time that the system is in every 
state. This constitutes an empirical estimate of 7r(J2' n -'). We 
then estimate the gradient G [n+1] = g(n(R [n] ), R [n] ) by 

G [n+1] = g(n ln+1] ,R [n] ). If we then apply Q using the 
estimated gradient instead of the actual gradient, we are 
essentially using the stochastic gradient algorithm 



(ID 



to update the parameters. 

Note that algorithm (0 is deterministic, whereas ( fTTb is 
stochastic. Also note that because we are estimating the 
gradient instead of explicitly calculating it, the algorithm in 
dTTb is no longer guaranteed to converge to r opt . 

C. Main result 

We now present technical assumptions which will guarantee 
convergence of ( fTTb . For this, we need an additional sequence 
e N which we shall refer to as the error. It is related to the 
maximum allowable error when estimating the steady-state 
probability vector, which will be made precise in i jlV-B 1 1 
We require the sequences eJ™!, and /["I = 1/(^"1 — 



£[»-!]) to be such that 



oo 
n=l 



n=l 



(a M ) 2 < oo, 



and 



< oo, a M exp 

n=l 



n}\2 



4|f2| 2 K/[ r ' 



(12) 



<oo, (13) 



for any k £ (0,oo). We also require boundedness and 
regularity of g(7r(r), r), in the sense that there exist constants 



c\ £ [0, oo) such that 



\gi(v, r ) - gi(y, r )\ < ciW/j, - u\\ var for i = i,...,d, (14) 



!lfl , (/^^) 



< Cr, 



(15) 



for all probability vectors /x, u and all r £ 1Z. Here, \\ft — 
u\\ vaI — h X)a;eo \l 1 x ~ v x\ is the total variation distance. 
Under conditions ( TT2b - ( fT3T > and the assumptions in iffl-AI 
and ijH-Bl the following result holds. 



Theorem 1. The sequence generated by the online 

algorithm ( 1111 ) converges to the optimal solution r opt of the 
optimization problem © with probability one. 

Condition (fT2l i is typical in stochastic approximation. It 
ensures that step sizes become smaller as n increases, while re- 
maining large enough so that the algorithm does not get stuck 
in a suboptimal solution. Condition ( fT3] > then requires that the 
error &- n ' for which we allow when estimating the steady-state 
probability vector must decrease. In order to guarantee this, 
the observation frequency must eventually become smaller 
than the error, i.e. (e^) 2 //^ — > oo as n — > oo. Condition 
dT~4T > ensures that when we approximate the gradient of u(r) by 
using empirical distributions that come increasingly closer to 
the actual 7r(r), our approximation of the gradient also comes 
increasingly closer to the actual gradient. It is the most non- 
trivial of all conditions and verification can be cumbersome. 
In ffiHl we discuss two illustrative examples for which ( TT4T > 
holds. Lastly, condition ( fT3T > guarantees that the gradient does 
not explode, preventing the algorithm from making extremely 
large errors. 

It is not difficult to define sequences that satisfy ( Tf2l i and 
(fT~3T >. For example, setting = n _1 , = n~ 2a ~P and 
e M = n -°> with a, (3 > suffices. In particular, note that 
for a = /3 = 1/3, we have aW = -nT 1 and t^ n+ ^ - fM = 
n + 1, which expresses that the algorithm should take smaller 
steps as time increases, while simultaneously lengthening the 
observation period. 

The choices for a)- n \ e' n ' and p n ' strongly influence the 
behavior of the algorithm. Consider for instance the following 
two cases. Setting a [ ™l = n^ 1/2 ^ a with < a < 1/2 so that 
it barely satisfies (fT2l . allows us to let decrease as slowly 
as e [nl = n~ 1/2 . By (O we then need that < or 
we now consider the faster decreasing step 
size al"! = which also barely satisfies ( TTZb . we find that a 
much slower decreasing = n - " with < a « 1 suffices, 
implying by (O that /W < n" 2 " or fM _ i^-i] > n 2 Q is 
required. From these two cases, one sees that smaller step sizes 
allow for shorter observation periods (recall that < a -C 1). 
The search for optimal settings of a)- n \ and /M is an 
important topic for future research. 



III. Example applications 

We now discuss two example scenarios in which TheoremQ] 
can be applied. The first scenario concerns the optimal trade- 
off between performance and costs in an Erlang loss system. 
The second scenario considers a log-likelihood function as an 
objective function in combination with product-form stationary 
distributions. We should stress that these two examples, par- 
ticularly the first one, primarily serve to illuminate the core 
features of our algorithm in relatively simple settings. These 
scenarios are not meant to reflect the full scope or unique 
realm of our algorithm and could conceivably also be tackled 
via alternative methods. 



A. Optimizing service, cost trade-off 

Consider the M/M/s/s queue. Customers arrive according 
to a Poisson process with rate A and each customer has an 
exponentially distributed service requirement with unit mean. 
Each of the s parallel servers works at rate r. The steady-state 
probability of x G 51 = {0, 1, s} customers in the system 
is then given by 



(X/r) x /a 



ELo(A/r)Vw!' 



(16) 



The steady-state probability that an arriving customer finds 
all servers occupied and is blocked is given by the Erlang 
loss formula B(s,r) = 7T s (r). The mean stationary queue 
length is given by L(s,r) = J2x=i X7Tx ( r )> anc ^ by Little's 
law, L(s, r) = A(l - B(s, r))/r. 

Suppose now that we want to minimize B(s, r) by adjusting 
r and that the costs of operating at service rate r equal c(r). 
Assume c(r) to be convex in r and its derivative c'(r) to be 
bounded for all r £ 7Z. We thus aim to minimize u(r) = 
B(s,r) + c(r). This objective function is convex in r |23|. 
Furthermore, 



B(s,r)(L(s,r) - s) 



c'(r), 



(17) 



for which we prove the following result in Appendix! 



Lemma 1. If 11= [TZ min ,TZ maK ] with < TZ min < TZ aillx < 
oo and g(/J,,r) is given by d!71 >, then there exists constants 
c g ,Q € [0,oo) such that conditions (114b . d 1 5b hold for all 
probability vectors fj,, v and all r G 1Z. 

Using Lemma [T] we conclude that all conditions of Theo- 
rem Q] are met and that the gradient algorithm 



£[n+l] (£,[«+!] _ s ) 



+ ti 



■R 



converges to the optimal solution. Here, B^ n+1 ^ = fl\^ +1 ^ 
denotes an estimate of the loss probability and U n+1 ^ = 
J2x=i xflx l+1 ^ denotes an estimate of the mean queue length. 

B. Log-likelihood and product forms 

Consider the log-likelihood function as defined in (O as 
objective function. We prove the following result in Appendix 

m 

Lemma 2. If 7r(r) satisfies the product form (f3]l, then the 
log-likelihood function u(r) in (O is convex in r. 

Using du(n(r)) / 'dir x — ~a x /ir x and substituting (0 into 
(0 yields 

0i(ff(r)) = 5^0x(5^A,,i7r s (r)-A e ,i). (18) 

xen yen 

We will only consider a S (0, 1)1°' that are probability 
vectors, so that l|n| T £* = 1. We can then interpret (fT8l as 
the difference between the expectation with respect to 7r(r), 



denoted by (^4 T 7r(r))i = J2 y t=n A y ,iTr y (r), and the expecta- 
tion with respect to a, denoted by (A T a)i = Y^xen Ax.i&x, 
so that 



g(n(r)) = A T n(r) - A T a. 



(19) 



We assume that r° pt lies in the interior of 1Z, in which case 
optimality requires g(7r(r opt )) = and thus A T n(r opt ) = 
A T a. We call 7 = A T a the target vector, a name inspired by 
the fact that our algorithm seeks r opt such that A T 7r(r opt ) = 7. 

Because u(r) is convex in r and the target 7 is achieved 
by the solution r opt of (@), we want to use our online gradient 
algorithm (fTTT i to find r opt . From {19}, it follows that \gi{p) - 
gi(v) < 2max :Ci i{|A :Ci i|}||/x-i/|| var for i = 1, ...,d, and that 
\\g(fJ-,r)\\ 2 < lOlrfmax^ijlA^il}, so that (O and (O are 
satisfied. Using Theorem Q] we then arrive at the following 
result. 

Theorem 2. Given any 7 £ M. d for which there exists an 
r opt in the interior of 1Z so that A T 7r(r opt ) = 7, the online 
gradient algorithm 

?[«+!] _ \ R [n] _ a [«+l] (A T fr ln+11 

converges to r opt with probability one. 



H in+xj = r fllnj _ oln+1J ^Ttl 1 '^ 1 ' - j)] n (20) 



As an illustrative example, consider a loss network con- 
sisting of L links with capacities c = (ci,...,cl) t shared 
by if customer classes. Class-fc customers arrive according to 
a Poisson process with rate Afc and require exponentially dis- 
tributed holding times with mean 1//Xfc. Each class-fc customer 
requires capacity Bk,i on link I for the duration of its holding 
time, i.e. B^.i — bkJk,u where bk is the nominal capacity 
requirement of a class-fc customer and Jk.i has the value or 1, 
indicating whether the route of class-fc customers contains link 
I or not. When an arriving class-fc customer finds insufficient 
capacity available, it is blocked and lost. Denote the number 
of class-fc customers in the network at time t by Xk(t) and 
define X(t) — (Xi(t), Xr-(£)) t . Under these assumptions, 
{X(t)} t >Q is a reversible Markov process with state space 
il = {x £ N K \Bx < c} and steady-state probability vector 



T*(P) 



1 



K 



K 



En 

yen k=i 



(pk. 



ilk 



Vk ! 



Here, pk — Xk/[ik denotes the offered traffic of class fc. 
Rewriting gives 

1 K 

k x {p) = -jT\ cxp (E Xk lnpk ~ M^fcOj > ( 21 ) 



fc=i 



which matches (01 with d = K, ru = In p^, A,,,* = xt and 

fc K = -Efci ln ( a; 'c ! )- Note that (-4 T 7r(r))fc = Y.yenVk^y 
is the carried traffic of class fc, i.e. the steady-state average 
number of class-fc customers in the system, which we can 
empirically estimate by observing the system. We apply our 
algorithm by setting 



pl n+1 l = cxp 



([In/ 



[n+l] 



(A T fl 1 ' 



7)] 



■R 



(22) 



in order to adjust the amount of offered traffic p so as to 
achieve target carried traffic levels 7. In practice, network 
operators usually have limited control over the amount of 
offered traffic, but they can typically adjust route selections 
fairly easily so as to achieve target blocking levels for a given 
offered traffic volume. Variations of the above algorithm can 
be used in such scenarios but go beyond the scope of the 
present paper. 

In related work, Jiang and Walrand 1171 . [18| present an 
algorithm for achieving target throughputs in wireless CSMA 
networks. Their model can be interpreted as a special case of 
a loss network with unit link capacities. Their algorithm and 
convergence proof are therefore special cases of Theorem [2] 

IV. Convergence proof 

We will now prove Theorem Q] In iCV-AI we first explain 
our notion of convergence and then derive conditions on 
the error bias and zero-mean noise so that convergence is 
guaranteed. In ijlV-BI we show that under the assumptions of 
Theorem Q] the error bias and zero-mean noise indeed satisfy 
the conditions derived in i)IV-A| 

A. Conditions for convergence 

Theorem Q] states that Rs n ' converges to r° pt with proba- 
bility one. In order to prove that, we will establish that the 
following two properties hold for arbitrary S,e > 0. As our 
first property, we want that R^ comes close to r° pt infinitely 
often. We make this precise by requiring that for any 5 > 0, 
the set Hs = {r £ R d \u(r) < <i(r opt ) + 5/2} is recurrent for 
{i?^ } n gN' As our second property, we want that once R^ 
comes close to r opt , it stays close to r opt for all future itera- 
tions. Mathematically, we require that there exists an m £ N 
large enough so that \\R [n] - r opt ||^ < \\R [m] - r opt ||^ + e for 
all 71 > to, which we will call capture of R} n '. 

We shall relate both recurrence and capture to the error bias 
and zero-mean noise, defined as B' n ' = E[g' ^J 7 !™ -1 !] — 
G N and E [n] = _ E [£ N | j-[n-i]] > respectively. Here, 
j^n—i) denotes the cr-field generated by the random vec- 
tors Z^\Z^\...,Z^~ X \ where Z® = (R^,X(0)) T and 

Z [n] = (G [n] ,R [n] ,X(tW)) T for n > 1. 

1) Recurrence: We begin with deriving conditions under 
which the set Hs is recurrent for { R)- 7 ^ } n gN> using the 
following result. 

Lemma 3 (El, p. 115). Let {R [n] }» be an R d -valued 
stochastic process, not necessarily a Markov process. Let 
jjrM j ]j e a se q Uen ce of nondecreasing a-algebras, with 
measuring at least {R^ \i < n}. Assume that cJ n+1 ] are 
positive J 7 !"! -measurable random variables tending to zero 
with probability one and ^ n w- n ^ — 00 with probability one. 
Let V{r) > and suppose that there are 5 > and compact 
Hs C M. d such that for all large n and all r £" Hs, 

E[V(R [n+1] )\^ n] ] - V{R [n] ) < -a [n+1] S < 0. (23) 

Then the set Hs is recurrent for {i?' n ^}„>o in the sense that 
jfjn] g j or [ n fmitely many n with probability one. 



Before we can apply Lemma [3] we need to identify a suit- 
able function V(R [n+1] ). The choice D(R [n+1] ) = \\R [n+1] - 
r* opt 1 1 2 comes to mind as a candidate, and we will therefore 
investigate (|23j for L»(i? [n+l1 ). We will need the following 
result, the proof of which is relegated to S|C] 

Lemma 4. For x, y G K and K = [K min , K max ] C R, \[x] n - 
[y]n\ < \x-y\. 

Combining (fTTT i and Lemma H] gives 
D(R^) < ^|i?| nl - a[" +1 lG[ n+11 - r° pt | 2 

i=l 

= X>I n] - r° p T + (« [,i+1] ) 2 EI^" +11 l 2 
i=i i=i 

d 

- 2al" +1 l 2 Gf +1] (i?! n] - r° pt ). (24) 

»=i 

Substituting G W = G [n] +B M +E [n] into the last term, we 
conclude that 

D(R^) < D(RW) + (a[" +1 l) 2 ||G [n+11 ||! 
- 2a [n+1] (G [n+1] + B [n+1] + E [n+1] ) T (R [n] - r opt ). (25) 

Before we take the conditional expectation that results in 
a form similar to d23) . recall that u(r) is convex in r. We 
therefore have that ((24), P- 69) 

G [n+l]T^opt _ fl [n]j _ g( 7r (i?N),iiH)T( T .opt _ JjM) 

= V r u(i? [ " ] ) T (r opt - R [n] ) < u(r opt ) - u(R [n] ). (26) 

It follows that if R [n] $ Us, then G [n+llT (r°«* - i? w ) < 
—5/2. This gives in combination with (|25) a term — <W n+1 ', 
which we need for 423). We now note that E[ J E [n+llT (i2 w - 
r opt)|j-[n]j = 0> so t hat for R [n] # Us, 

E[D(R [n+1] )\T [n] ] - D(R [n] ) < -<5a [ " +1] + Y^ +1 \ (27) 
where 

y[n+i] = ( a [«+i])2 E [|| G [ " +11 ||2| J -N ] 

+ 2a[" +1 ]|E[B [n+llT (il [nl - r c, *)|.FM]|. (28) 

The upper bound in d2Tb is not yet of the form of the 
right-hand side in ([23). This implies that D(R [n+1] ) by 
itself is not an appropriate candidate for V(R.' in+1 ^). How- 
ever, we can modify it slightly so that it does satisfy ( 123) . 
For this, define A'"' = E[J2Zn+i yW I- 7 "" 1 " 1 ] and consider 
V(R [n+1] ) = D(R [n+1] ) + A[" +1 l instead. The difference 
E[A[" +1 ]|J"W] - AN is well-defined if J^Zi rW < 00 with 
probability one and is then equal to 

oo oo 

E[E[ V V r [ll |J" w ] = -Y [n+1] . (29) 

z— n+2 i—n+1 



We conclude that 

E[V(.R [n+11 )|.F [nl ] - y(flW) 

= E[D(R [n+1] )\^} - L>(i?["l) +E [a["+ 1 ]|J-N] _ A N 

= E[£>(fl [n+1] )|.F[ n ]] _ £)(i?N) _ < - ( 5 a ["+ 1 ]. 

(30) 

The upper bound in ( 130) is of the form of (|23) . meaning 
that we are almost ready to apply Lemma [3] What remains is 
to check whether 

oo oo 

E rN = E(° [nl ) 21E [ll 6w llll^ [n " 11 ] 

n— 1 n— 1 

oo 

+ 2^2a^\E[B^ T (R [n ~ 1] - r opt )\^ n -^}\ < oo (31) 
n=l 

with probability one. Since X^i( a '™') 2 < 00 and 11^" lh < 
c g by assumption, the first term is finite. Verifying that the 
second term is finite with probability one is much harder 
because it involves regularity conditions on g(n(r),r) and 
finiteness of mixing times. This can in fact be shown as stated 
in the next lemma, proved in m V-B 1 1 

Lemma 5. Under the assumptions of Theorem [7] the sum 
Y^ =1 a^\E[B [n]T {R [n - 1] - r opt )\F^-^]\ is finite with 
probability one. 

2) Capture: Having derived conditions under which Us is 
recurrent, we turn our attention to deriving conditions under 
which capture occurs. Recall that capture means that there 
must exist an m g N large enough so that D(R^) < 
_|_ e f or all n > m with probability one. 

After applying (|25) repeatedly and using the upper bound 
G [n]T( R [n-l] _ r o P t) > Qi which f n 0W s by convexity of 

u(r), we find that 

D(RW) < D(flH) + Y,( aU+1] ) 2 \\G U+1] Wl 

n 

- 2 £ a^(B^ + E^) T (R b] ~ r opt ). (32) 

j=m 

We now need to show that each sum in the right-hand side 
of ( |32) becomes small for m sufficiently large. Because 

X^i( a '™') 2 < 00 and 11^^112 < c g , it immediately 
follows that lim m ^ 00 ^ m ( a N)2|| G [ ,l ]|| 2 = Q ^ ^ 

this implies that for any e, there exists an mo 6 N so that 

YTj= m ( a[n] ) 2 \\G [n] h < £ for all n > m > m . Verifying 
that the other two sums become small is substantially more 
difficult. This can be established using martingale arguments, 
as asserted in Lemma [6] the proof of which is postponed to 
qiV-B2l 

Lemma 6. Under the assumptions of Theorem\l\ for any e > 

0, there exists mo £ N so that for any n > m > mo 

(i) E"= m a b] B MT (r opt - R [3 ~ 1] ) < e and 

(ii) E;= m a [ '' 1 £ [ '' ]T (r opt - < e 



with probability one. 

Our work thus far can also be used to prove that the gradient 
algorithm (|7|i converges. It is a special case of its stochastic 
counterpart (TT), for which B [n] = 0, E [n] = 0, G W = 
G [n] = and R [n] = for all n > 0. To prove that (Q 
converges, we apply ( f25l ) repeatedly and use that gr["] T (r opt — 
^1™-!]) < u (r opt ) — M(r[" _1 l) for any n G N by convexity of 
u(r), so that 



D(r [n] ) < D(r [0] ) 



^(a y+1 W +1] ii2 

3=0 



-2j2 ab+1] Hr [: > ] ) - u{r opt )). (33) 

3=0 

Noting that D(i-H) > for all neN and D(rM) < c r for 
some constant c r < oo since r' ' G 7?., we conclude that 



(u(r^) - w(r opt )) < c r + C 2 > , ( a U+ 1 )) 2 . (34) 



3=0 



3=0 



Since £™ =0 a b+1] (u(r^) - u(r opt )) > min 4=0 ,...,„{u(rW) - 
it(r opt )} 53j=o a ^ +1 'i we nave me inequality 



c g£j=oO 



,b+ih2 



mm {u(rW) - M (r° pt )} < ^^=0^'^ 



(35) 



which converges to as n — > oo. 

From this little detour we see that it is much easier to 
establish convergence for (|7]i than for its stochastic counterpart 
(flTT l. It is the error bias and zero-mean noise that make the 
convergence analysis of ( fTTT l so much harder. 

B. Evaluating the conditions 

We now provide the proofs of Lemma [5] and [6] which 
together prove Theorem Q] In our proofs, we choose to 
consider the error bias and zero-mean noise separately, which 
makes the analysis more tractable. 

1 ) Error bias: We start by showing that the error bias sat- 
isfies the property claimed in Lemma [5] under the assumptions 
of TheoremQ] After substituting the definition of the error bias 
and using the triangle inequality, one finds that 



J2 a W | E [B W T ( fll" - ^ - r opt ) | F [n ~ 1] ] | 

71=1 



E 

n=l 

oo 



a^\j2nBl n] \^ [n - i] ](R l - 



opt 



<E« [ 



71=1 



2=1 

d 

1=1 
d 

E 

i=i 



(n?™-nf n )\B\ n] \. 



(36) 



The inequality is a consequence of 7?. being a hypercube. We 
have also used the fact that Efsj" 1 IT^-^} = B W , which 



follows from the definition Gf l = g l (Tv(R [n 



I \n] I 

We now bound LB] J | from above. After recalling that 
Bj" 1 = E[G l ^ ] \^ n -^} - Gf ] and using Jensen's inequality, 
we find that | Bj™^ | equals 

|EI> i (ft W ,fl [n - 1 ])|^ n - 1 ]]-ffi(7r(H [n - 1] ),fl ln - 1] )| 

= |E[ 5l (n w ,i? [ "- 11 )~ ffl (7 r (H [ "- 11 ),fi [n - 11 )|j-[™- 1 ]]| 

<E[|5 i (ft [T,1 ,fl [n - 1] )-5i(ff(H [n - 1 ]),Jl [n - 11 )||^ n - 1 ]]. 
Recalling condition (TT4T > gives 

| B N| < |^ E [|nW - „ x (Rl n -V)\\Fl n -% (37) 



Finiteness of Q6j can now be proven by constructing an 
upper bound for ( f37l i. We can obtain such a bound using the 
following lemma, proved in Appendix [Pi 

Lemma 7. There exist c e , k G [0, oo) such that for G [0, 1] 
goto? a; G fi, 

F[|n^-.^^)|>eN]< Ce exp(-J^ 

Define = (ni" 1 - ir x (R [n - 1] )\ and let £ N G [0,1]. 
Using d37l i and then Lemma [7J yields 

|B|1<f EE^Nl^]] 



anGSl 



2 



^(p[$N < e M]E[*M|.F In - :l] ,$ 
+ P[$L" ] > eWjE^M^- 1 ],^ > e w ] 

- - E ( eM + c 1 ~ e[,il ) p [ $ i" 1 > eW i) 

( e M)2 



< e' 



sen 



c,|fi| 
< max 



{l,Ce}(e 



exp 



4|f7| 2 K/H 



(38) 



After bounding ( l36l l from above using ( 1381 ). it follows from 
JT3b that 

OO 

^ a M| E [ B MT( H [n-l] _ r op t) |^[„-l]j| < ^ (39) 

71=1 

which completes the proof of Lemma [5] 

We now show that the error bias satisfies assertion (i) in 
Lemma [6] under the assumptions of Theorem Q] Similar to the 
derivation of i 



^ a W| B NT (i? [n-l] _ r op t) | 
n=l 

oo d 

< a [n] E P 2 *"" - ^ nin ) I I ■ (40) 

n=l 7=1 

Combining (f40l i. < f38l > and ( fl3l >. we conclude that with proba- 
bility one, 



,R [ 



^ a N|_BNT (jR [n-l] _ r opt)| < ^ 



(41) 



so that lim m ^ 00 J2JL m a [n] B [n]T (r°P l - R [n - 1] ) = with 
probability one. This implies that there exists an too 6 N so 
that for all n > to > m , T,"= m B lj]T (r ? 1 - R b ~ 1] ) < e 
with probability one. The error bias thus satisfies assertion (i) 
in Lemma [6] All that remains is to show that the zero-mean 
noise satisfies Lemma |6jii). 

2) Zero-mean noise: We use a martingale argument to 
show that assertion (ii) in Lemma [6] holds. We start our 
argument by defining AfW = £"=i E b]T (R lj ~ 1] - r opt ). 
See Appendix [E] for a proof of the following result. 

Lemma 8. is a martingale. 



We will use a martingale convergence theorem [25 1 to show 
that for n > m both sufficiently large, M'"' - Ml" 1 " 1 ! < £ 
with probability one. 

Theorem 3. If{M^} is a martingale for which there exists a 
constant c m < oo so that E[(Af["l) 2 ] < c m for all n > 0, then 
there exists a random variable M° pt with E[(Af opt ) 2 ] < c m 
such that M [™1 — > M opt vw'f/z probability one as n —¥ oo. 
Moreover, E[|AfW - Af opt | 2 ]5 ->0fljn^oo. 

Before we can apply Theorem[3j we need to show existence 
of a c m G K such that E[(MW) 2 ] < c m for all n e N. To 
show this, expand 

n 

supE[(M [n] ) 2 ] =sup{^(a [ i ] ) 2 E[{E [ J ]T (R [ 3- 1] - r opt )) 2 ] 

+ ^a^a[ fc lE[£;M T (^'- 1 l -r^" 1 ^- 1 ' -r opt )]}, 

and then consider any one of the cross terms with k < j. By 
the tower property, 

E[E^ ]T (R^>- 1] - r opt )E [k]T (R [k - 1] - r opt )] 
= E[E[E U]T (R [j - 1] -r opt )E [k]T {R [k ~ 1] -r°v l )\T [j - 1] }] 

d 



= E[E^ T (R^ - r opt ) J2 E i E i ] l-^W" 11 " r T)l 

i=i 

and because E[E^ | J 7 '- 7-1 !] = 0, all cross terms are equal to 
0. Because the summands are positive, we can give an upper 
bound by summing over all terms, so that 

oo 

supE[(M w ) 2 ] < ^(a [j ' 1 ) 2 E[(^ y]T (J? y - 11 -r opt )) 2 ] 



3 = 1 

oo d 

= j2(^ ] ) 2 n(E E i ] ( R t 1] -Of 

3=1 »=1 

Using the triangle inequality, we find that 



(42) 



Now note that 2~2t=i\ E i ] \ = ll^lli. write 

p^Hi <E[||G [j1 ||iI^' [j " 11 ] + I|G' [j1 ||i 
< VdE[\\G b] \\ 2 \^-^} + Vd\\G b] \\ 2 < 2c g ^d (43) 
and recall that 1Z is a hypercube. We conclude that 

oo 

supEpfM) 2 ] < Acid max {{11™* - Tlf n ) 2 } V (a^) 2 . 

n B i=l,. ..,d ' 

3 = 1 

The right-hand side is finite by condition ( fT2b . and we see that 
there indeed exists a coefficient c m so that E[(Af["l) 2 ] < c m 
for all n £ N. We now apply Theorem [3] and conclude that as 

n > m — > oo, 

E[|M [ ™ ] - M [m - 1] | 2 ]5 < E[|A/ W - M opt | 2 ]5 

+ E[|M I™" 1 ! - Af opt | 2 ]5 -> 0. (44) 

This result enables us to use Doob's maximal inequality 
ll25l . as reproduced in the lemma below, in order to conclude 
that Lemma |6jii) holds. 

Lemma 9. If {Af[™l}„>o is a nonnegative submartingale and 
A > 0, then 

AP[sup Af[™l > A] < E[Af["]l[sup > A]] < E[M^]. 

m<n m < n 

Fix to e N and define = M^ n+m ~^ - M^ 11 ^ 

for n g N. |Wl n J| is a submartingale by Jensen's inequality 
with respect to the sequence jr[m-i] jr[m]^ jrlm+i]^ ( since 
E[|VK [n+1 ]||J" [,l+m_1 ]] > E[VF[™ +1 ]|J r [™ +m_1 l] = WH 
Applying Lemma [9] to |Wl n l|, we find that 



'[ sup \M 

0<t<n 



< 



< 



< 



[t+m-1] _ ^[ m -!]| > \l 
E [| M [n+m-l] _ J^[m-l]|] 

A 

E p f [«+m-l] _ M opt|] + E [| M opt _ 

A 

E p f [n+m-l] _ M opt|2]i + Ep/opt _ M [m-l]|2]| 

A 



su P E[(Af[«i) 2 ] ^^(^mE^i i^ 11 - r ri) 

n , ■ , 

3=1 i=i 



for any A 6 (0, 00) and m € N. This upper bound converges 
to as n, m — > 00, implying that there exists an too € N, 
such that for all n > m > to , AfW - A/I" 1 " 1 ! < e with 
probability one. 

Having established Lemma [5] and [6] the proof of Theorem[T] 
is now completed. 

V. Conclusions 

We have developed an online gradient algorithm for finding 
parameter values that optimize the performance of reversible 
Markov processes with product-form distributions. As a key 
feature, the approach avoids the computational complexity of 
calculating the gradient in terms of the stationary probabilities 
and instead relies on measuring empirical time fractions of 
the various states so as to obtain estimates for the gradient. 
While the impact of the induced measurement noise can be 
handled without too much trouble, the bias in the estimates 



presents a trickier issue. In order to exploit mixing time results 
to deal with the bias, we focussed on reversible processes. 
We expect however that convergence can be established under 
milder conditions. 

For fast convergence, the algorithm needs to strike a balance 
between the step sizes and the lengths of observation periods, 
which is a consequence of the existence of two time scales 
- one being the mixing time of the underlying stochastic 
process and the other being the iteration sequence generated 
by the algorithm. Intuitively, the step sizes should not have 
become too small by the time that the observation periods 
have become larger than the mixing time. The convergence 
of the algorithm would otherwise slow down drastically. A 
challenging issue for further research is to gain a more detailed 
understanding of the effect of step sizes and the role of mixing 
times in relation to the convergence speed. A related direction 
is to explore the trade-off between accuracy in static scenarios 
and responsiveness in dynamic environments, which relates 
to convergence in distribution for non-vanishing step sizes as 
opposed to the almost-sure convergence for decreasing step 
sizes as considered here. 
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Appendix 

A. Proof of Lemma [7J 

Define = fi s and = f° r a ^ A* G [0, 1]'° 

for which l|o| T /^ = 1. By definition of g(fi, r), \\g(fJi, r)\\2 < 
|-B M (L M — s)\/r+ \c'(r)\ < oo. The first term is finite because 
r > K min > 0, < 1 and L M < s < oo. The second term 
is finite by our assumption that c'(r) is bounded for all r £ 1Z. 
This proves that condition ( fl5l > is met. 

We now turn to condition JT4V Write \g(fJ,, r) — g(v. r)\ — 
\B^{L„ - s) - B V {L V - s)\/r < {B^L^ - sB^ - B V L V + 
sB u \/K min < QB^L^ - B V L V \ + s\B„ - B„\)/K min . We 
then conclude that {B^L^ - B V L V \ = {B^L^ - B^L^ + 
B ll L v — B V L V \ < B^Lfj, — L„| + L^B^ — B v \ < |L M — 
L u \+s\B^-B u \, so that \g(p, r) - g(u, r)\ < (\L fJl -L u \ + 
2s\B^-B„ \)/K min . Finally, by definition of S M , {B^-B^l = 
I Ms - v s\ < 2| ]/x - H|var- Similarly for L^, |i M - L v \ < 
J2l=i x \^-Vx\ < 2s||/*-i/|| var . Thus \g(fi,r)-g(u,r)\ < 
6s 1 1 /x — i>>| Ivnr /7Z mm , which concludes the proof after setting 
ci = 6s/ft min . □ 

B. Proof of Lemma |2] 
Substituting (O into (fJJ gives 

u(r) = In ^ exp (Ar + b) y -J2a x (Ar + b) x . (45) 

The function v(s) ~ In J2 y en ex P s v ~ X^eS! a xS x is convex 
on R' n l l24l . p. 72. We see that u(r) is a composition of a 
convex function with an affine mapping, i.e. u(r) = v(Ar+b), 
and such functions are convex |24l . p. 79. □ 



C. Proof of Lemma [4] 

Define / = TZ min and r = TZ max . If x, y £ TZ, equality holds. 
Consider the case x TZ, y £ TZ. If x > r, \[x]n — [y]n\ = 
\r - y\ = r - y < x - y = \x - y\. If x < I, \[x] n - [y] n \ = 
\l — y\ = y — I < y — x = \x — y\. Finally, consider the case 
x,y g TZ. If x,y > r or x,y < I, \[x]n~ [y]n\ = < \x-y\. 
If x >r,y <l, \[x\n-[y\n\ = \r-l\ = r-l < x-y = \x-y\. 
The case x < l,y > r follows from a similar argument. □ 

D. Proof of Lemma 

Let Var M [/] = \Y,*, y &n{f( x ) ~ f(v)) /^/V = 
Exen f( x )9 ( X )V>* and llMlk* = (Exefi vl"x) 1/2 - 

Proposition 1 ((26), p. 2). On some Polish space fi, let 
us consider a conservative (continuous-time) Markov process 
denoted by {X(t)} t >a and with infinitesimal generator C. 
Let p be a probability measure on which is invariant and 
ergodic with respect to Pp 

Assume that p satisfies the Poincare inequality Var /X [/] < 
—n(Cf, f)fj,. Then for all 9 such that sup \9\ = 1, all < e < 
1 and all t > 0, assuming that the initial distribution of X s 
is v, 



< 



1 

dv 
dp. 



9{X{s))ds - J 9dp 
( te 2 



> e 



2,// 



exp i 



8/cVar M [0] 



(46) 



Lemma [7] is a direct consequence of Proposition Q] Before 
we can use Proposition Q] to prove Lemma |7l however, we 
need to verify all of its assumptions. We will now verify these 
assumptions for continuous-time, reversible Markov processes 
with a product form solution. Our method is based on an 
approach for discrete-time Markov chains |fl5l . 

Define a graph G — (V,E), where V denotes the vertex 
set in which each vertex corresponds to a state in f2 and E 
denotes the set of directed edges. An edge e = (x, y) is in E if 
</>(e) = Tr x Qx,y — KyQy,x > 0. Here, Q denotes the generator 
matrix of {X(t)}t>o- For every pair of distinct vertices x, y £ 
il, choose a path r y x ^ y (along the edges of G) from x to y. 
Paths may have repeated vertices but a given edge appears at 
most once in a given path. Let Y denote the collection of paths 
(one for each ordered pair x,y). Irreducibility of {X(t)} t >o 
guarantees that such paths exist. For j XtV £ T define the path 
length by \\j x , y U = Eee^J 1 /^))- Also ' let 

K = max E II 

^x,y II (jy^x^y 

(47) 

{7*,y6r|ee7 ;c , H } 

and /(e) = f(y) — f(x) for e = (x,y) £ E. Then write 

V^4E(E(iW^, (48) 
Use the Cauchy-Schwarz inequality |a; T y| 2 < x T x ■ y T y to 



obtain 
Var-jr 



x,y€.Q e 67x,y e 67x, y 

= \ E 7r ^ 7r aii7x, y iu( E <^( e )/( e ) 2 ) 

= ^^2^(e)f(e) 2 E hx,y\U^y 
{7x,j,er|ee7 x , B } 



eeE 



Use the definition of k and the symmetry of <f>(e) to write 
Vax w [/]<|E>(e)/(e) 2 

eeE 

= \ E *yQv,x(f(y) 2 -f(y)f(*)) 



+ f E ^Q«,«(/(*) a - /(»)/(*)) 

= « E Qx : y(f(x) - f{y))f(x)TT x 



= k e (E ^^(/(^ - /(»))) ( 49 ) 

By definition of the infinitesimal generator C, we find that 

(£/)(*) = lim i(E (e tQ ) X)V /(y) - /(*)) 

yen 

= E ( 3 x *^^) = E Qx,yf(y) + Qx,xf(x) 
yen yen\{ x } 

= E Qx, y f(y)- E Q*,yf( x ) 

y en\{x} yen\{ x } 

= Y,Q*Mv)-H x ))> (50) 

yen 

after which one can conclude that Var .„■[/] < — K(Cf,f) n . 
We also note that when choosing 9(X(t)) — \\X(t) = z], we 
have that 



Var, 



M = \ E C 1 ^ = z i - 1 i» = z 2 ^^ < 



2 



Now starting from any state y, i.e. the probability distribution 
with unit mass in state y, we have for the initial distance 



dv 
dp 



2 ^ v ^ v ^ 



< 



since p, = tv. Because TZ is bounded, min^gn is bounded 
from below by some constant l/c c £ (0, oo). □ 



E. Proof of Lemma \8\ 

First note that M<- n ' 6 J 7 '™' and that its expectation is 
bounded, which can be concluded after writing 

E[\M^}<j2^ ] n\J2 E i ] ( R t 1] -rT)\} 

d 

M i 

3=1 

and then substituting (|43T >. Also, 



< max^™ - TZf n }E[J2\E\ j] |] (51) 



E [ M N|jr[«-i]] = E[^aWs b ' lT (i2 b '- 1] - r opt )| J"!"- 1 !] 
j'=i 

= Afl™- 1 ! + a ["]E[£; [ ™ ]T (H [n - 11 - r opt )\T^-^} = M^'^, 
which concludes the proof. □ 



